Realtime Speech-to-Text API

Real-time speech-to-text streaming over a single WebSocket connection.

This endpoint is a dedicated transcription stream: you push raw audio frames in and receive incremental transcriptions and, optionally, translations back as JSON.

Step 1. Get API credentials

Get your Client ID and Client Secret from the Palabra API keys section.

Step 2. Create a session

Exchange your credentials for a short-lived token by calling POST /session-storage/session. The publisher field in the response is the token you pass when connecting.

import requests

def get_token(client_id: str, client_secret: str) -> str:
    resp = requests.post(
        "https://api.palabra.ai/session-storage/session",
        json={"data": {}},
        headers={"ClientId": client_id, "ClientSecret": client_secret},
    )
    resp.raise_for_status()
    return resp.json()["data"]["publisher"]

Step 3. Connect

Open a WebSocket to the endpoint below. The token from Step 2 must be passed as the token query parameter. All other stream settings are passed as query parameters in the same URL.

wss://api.palabra.ai/asr/v1/speech-to-text/stream?token=<TOKEN>&language=en&format=pcm_s16le&sample_rate=16000

import websockets

url = (
    "wss://api.palabra.ai/asr/v1/speech-to-text/stream"
    f"?token={token}&language=en&format=pcm_s16le&sample_rate=16000"
)
ws = await websockets.connect(url)

Query parameters

Parameter	Required	Description
`token`	yes	Session token
`format`	yes	Audio format (see Audio formats)
`sample_rate`	conditional	Sample rate in Hz. Required for all raw PCM formats; for `pcm_s16le` required only when the rate is not 16000
`language`	no	Source language code. Defaults to auto
`translate_languages`	no	Comma-separated target languages, e.g. `es,de,fr`
`enable_filler_filter`	no	Whether to enable the filler filter. `true` by default for all languages except `ja`

Step 4. Send audio

Send audio as raw binary WebSocket frames. Chunks of 320 ms are recommended.

await ws.send(data)

Audio formats

`format`	`sample_rate`	Notes
`pcm_s16le`	only if ≠ 16000	16-bit signed little-endian PCM. Recommended
`pcm_f32le` / `pcm_f32be`	required	32-bit float PCM
`pcm_s32le` / `pcm_s32be`	required	32-bit signed PCM
`mulaw` / `alaw`	required	G.711
`webm` / `mp3` / `aac` / `ogg` / `flac` / `wav`	not used	Container formats; rate is read from the stream

Step 5. Receive messages

All server-to-client messages are JSON text frames. Switch on message_type.

`transcription`

Emitted continuously as speech is recognized.

{
  "message_type": "transcription",
  "transcription_id": "a1b2c3d4",
  "language": "en",
  "is_eos": false,
  "segment": {
    "text": "Hello world how are",
    "start_time": 0.32,
    "end_time": 1.84
  },
  "delta": {
    "text": "how are",
    "start_time": 1.20,
    "end_time": 1.84
  }
}

Field	Description
`transcription_id`	Stable id for the segment. All messages of one segment share the same id. A new id means a new segment has started
`language`	Detected (or configured) source language of this segment
`is_eos`	`false` — partial; the segment is still being updated. `true` — the segment is committed and final
`segment.text`	The full text of the segment so far
`segment.start_time` / `end_time`	Segment timing, in seconds relative to session start
`delta`	Incremental hint: the text added since the previous partial of the same segment (see below)

Working with `delta`

When the filler filter is disabled, delta.text is append-only: each transcription message carries exactly the text appended since the previous partial, so you can concatenate deltas directly.

With the filler filter enabled, the recognizer's tail might be rewritten mid-segment, which breaks the append relationship. In that mode treat segment.text as authoritative and overwrite the current segment on each message; use delta only as a hint.

`translated_transcription`

Sent only when translate_languages is set, once per target language, after each final (is_eos: true) transcription.

{
  "message_type": "translated_transcription",
  "transcription_id": "a1b2c3d4",
  "language": "es",
  "is_eos": true,
  "segment": {
    "text": "Hola mundo, ¿cómo estás?",
    "start_time": 0.32,
    "end_time": 1.84
  }
}

transcription_id matches the id of the source transcription (the is_eos: true one) this translation was produced from — use it to correlate a translation back to its original segment. language here is the target language, and is_eos is always true (translations are produced only for finalized segments).

Errors

Authentication and routing failures are reported as HTTP status codes during the WebSocket upgrade, before the connection is established:

HTTP status	Meaning
`401`	Missing or invalid token
`409`	A session is already active for this identity

After a successful upgrade, the server does not send application-level error messages over the wire — it closes the connection with a standard WebSocket close frame.

Complete example

Streams microphone audio and prints transcriptions (and translations, if PALABRA_LANGUAGE targets are configured).

pip install pyaudio websockets requests
export PALABRA_CLIENT_ID=...      # from Step 1
export PALABRA_CLIENT_SECRET=...
export PALABRA_LANGUAGE=en        # source language

import json
import os
import asyncio
import threading
import queue

import pyaudio
import requests
import websockets

WS_URL = "wss://api.palabra.ai/asr/v1/speech-to-text/stream"
SESSION_URL = "https://api.palabra.ai/session-storage/session"
LANGUAGE = os.environ.get("PALABRA_LANGUAGE", "en")

SAMPLE_RATE = 16000
CHANNELS = 1
CHUNK = 5120  # samples ≈ 320 ms at 16 kHz (recommended chunk size)


def get_token() -> str:
    resp = requests.post(
        SESSION_URL,
        json={"data": {
            "subscriber_count": 0,
            "publisher_count": 1,
            "publisher_can_subscribe": True,
        }},
        headers={
            "ClientId": os.environ["PALABRA_CLIENT_ID"],
            "ClientSecret": os.environ["PALABRA_CLIENT_SECRET"],
        },
    )
    resp.raise_for_status()
    return resp.json()["data"]["publisher"]


def mic_reader(audio_queue: queue.Queue, stop_event: threading.Event):
    pa = pyaudio.PyAudio()
    stream = pa.open(
        format=pyaudio.paInt16,
        channels=CHANNELS,
        rate=SAMPLE_RATE,
        input=True,
        frames_per_buffer=CHUNK,
    )
    print("Microphone open, speak now...")
    try:
        while not stop_event.is_set():
            audio_queue.put(stream.read(CHUNK, exception_on_overflow=False))
    finally:
        stream.stop_stream()
        stream.close()
        pa.terminate()


async def stream(token: str):
    url = (
        f"{WS_URL}?token={token}&language={LANGUAGE}"
        f"&format=pcm_s16le&sample_rate={SAMPLE_RATE}"
    )

    audio_queue: queue.Queue = queue.Queue()
    stop_event = threading.Event()
    threading.Thread(
        target=mic_reader, args=(audio_queue, stop_event), daemon=True
    ).start()

    async with websockets.connect(url) as ws:
        print("Connected")

        async def send_audio():
            loop = asyncio.get_event_loop()
            while True:
                data = await loop.run_in_executor(None, audio_queue.get)
                await ws.send(data)  # raw binary frame

        async def receive():
            async for message in ws:
                msg = json.loads(message)
                msg_type = msg.get("message_type")

                if msg_type == "transcription":
                    text = msg["segment"]["text"]
                    tid = msg.get("transcription_id", "")
                    if msg.get("is_eos"):
                        print(f"\n[EOS] {text} [{tid}]")
                    else:
                        # segment.text is the source of truth — render it whole
                        print(f"\r      {text}", end="", flush=True)

                elif msg_type == "translated_transcription":
                    lang = msg.get("language", "?")
                    tid = msg.get("transcription_id", "")
                    print(f"\n[{lang}] {msg['segment']['text']} [{tid}]")

        try:
            await asyncio.gather(send_audio(), receive())
        finally:
            stop_event.set()


if __name__ == "__main__":
    token = get_token()
    print("Session created")
    try:
        asyncio.run(stream(token))
    except KeyboardInterrupt:
        print("\nStopped.")

Step 1. Get API credentials​

Step 2. Create a session​

Step 3. Connect​

Query parameters​

Step 4. Send audio​

Audio formats​

Step 5. Receive messages​

transcription​

Working with delta​

translated_transcription​

Errors​

Complete example​